Diabetes, according to the World Health Organization, is a chronic disease that occurs when blood glucose levels are abnormally high, posing a risk to several body organs such as the heart, eyes, kidneys, and lungs (World Health Organization 2020). There is currently no cure for diabetes, but it is possible to predict an individual's likelihood of developing the disease using medical indicators such as blood pressure, glucose levels, insulin levels and genetics.
Recognizing the importance of early diabetes diagnosis, this project utilised the Pima Indians diabetic dataset from the machine learning repository at the University of California, Irvine to develop a supervised classification machine learning model. Through the application of data science, this initiative aims to assist medical practitioners in increasing the life expectancy of women in the community.
1.1 Variables Description
1.2 Checking number of rows and columns
1.3 Checking data types and distributions
1.4 Handling missing values
1.5 Detect Outliers
1.6 Conclusion
2.1 Exploratory Data Analysis (EDA)
2.1.1 Explore distribution of Non-Diabetic (0) and Diabetes (1) data in each independent parameter
2.1.2 Relationship between features
2.1.3 Formulate hypothesis
2.2 Identify learning problems for machine learning
3.1 K-Nearest Neighbour (KNN)
3.2 Decision Tree
3.3 Random Forest
4.1 Ratio selection
4.2 Data partitioning process
5.1 Expanding hyperparameters
5.2 K-fold Cross Validation
5.2.1 Splitting the training data set in section 4.2 into train and validation sets
5.2.2 Select number of KFold
5.2.3 Validation performance for each learning algorithm
5.2.4 Select best model performance
6.1 Hyperparameter tuning
6.2 Evaluate model accuracy score and confusion matrix
# Import Libraries
import pandas as pd
import numpy as np
import io
import matplotlib.pyplot as plt
import seaborn as sns
import re
import missingno
import warnings
warnings.filterwarnings("ignore")
#Loading data set
project_data = pd.read_csv('diabetes.csv')
project_data
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
| 1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
| 2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
| 3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
| 4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 763 | 10 | 101 | 76 | 48 | 180 | 32.9 | 0.171 | 63 | 0 |
| 764 | 2 | 122 | 70 | 27 | 0 | 36.8 | 0.340 | 27 | 0 |
| 765 | 5 | 121 | 72 | 23 | 112 | 26.2 | 0.245 | 30 | 0 |
| 766 | 1 | 126 | 60 | 0 | 0 | 30.1 | 0.349 | 47 | 1 |
| 767 | 1 | 93 | 70 | 31 | 0 | 30.4 | 0.315 | 23 | 0 |
768 rows × 9 columns
Based on the information provided by the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK https://www.niddk.nih.gov/), which collected and contributed this data collection, as well as the analysis from Assignment 2, the following table describes each variable in this data set.
# Checking number of rows and columns
print("\nNumber of rows and columns in the Pima Indians Diabetes data set: ", project_data.shape)
Number of rows and columns in the Pima Indians Diabetes data set: (768, 9)
# Checking data types
project_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 768 entries, 0 to 767 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pregnancies 768 non-null int64 1 Glucose 768 non-null int64 2 BloodPressure 768 non-null int64 3 SkinThickness 768 non-null int64 4 Insulin 768 non-null int64 5 BMI 768 non-null float64 6 DiabetesPedigreeFunction 768 non-null float64 7 Age 768 non-null int64 8 Outcome 768 non-null int64 dtypes: float64(2), int64(7) memory usage: 54.1 KB
# Checking data ranges
project_data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Pregnancies | 768.0 | 3.845052 | 3.369578 | 0.000 | 1.00000 | 3.0000 | 6.00000 | 17.00 |
| Glucose | 768.0 | 120.894531 | 31.972618 | 0.000 | 99.00000 | 117.0000 | 140.25000 | 199.00 |
| BloodPressure | 768.0 | 69.105469 | 19.355807 | 0.000 | 62.00000 | 72.0000 | 80.00000 | 122.00 |
| SkinThickness | 768.0 | 20.536458 | 15.952218 | 0.000 | 0.00000 | 23.0000 | 32.00000 | 99.00 |
| Insulin | 768.0 | 79.799479 | 115.244002 | 0.000 | 0.00000 | 30.5000 | 127.25000 | 846.00 |
| BMI | 768.0 | 31.992578 | 7.884160 | 0.000 | 27.30000 | 32.0000 | 36.60000 | 67.10 |
| DiabetesPedigreeFunction | 768.0 | 0.471876 | 0.331329 | 0.078 | 0.24375 | 0.3725 | 0.62625 | 2.42 |
| Age | 768.0 | 33.240885 | 11.760232 | 21.000 | 24.00000 | 29.0000 | 41.00000 | 81.00 |
| Outcome | 768.0 | 0.348958 | 0.476951 | 0.000 | 0.00000 | 0.0000 | 1.00000 | 1.00 |
1. Number of rows:
2. Number of columns:
3. Data types:
In conclusion, this data set is sufficient for developing supervised machine learning classification models.
#checking null values
project_data.isnull().sum()
Pregnancies 0 Glucose 0 BloodPressure 0 SkinThickness 0 Insulin 0 BMI 0 DiabetesPedigreeFunction 0 Age 0 Outcome 0 dtype: int64
project_data.describe().T[['min', 'max']]
| min | max | |
|---|---|---|
| Pregnancies | 0.000 | 17.00 |
| Glucose | 0.000 | 199.00 |
| BloodPressure | 0.000 | 122.00 |
| SkinThickness | 0.000 | 99.00 |
| Insulin | 0.000 | 846.00 |
| BMI | 0.000 | 67.10 |
| DiabetesPedigreeFunction | 0.078 | 2.42 |
| Age | 21.000 | 81.00 |
| Outcome | 0.000 | 1.00 |
These 0 values may have been encoded from raw Null data during the prior data acquisition process. Consequently, to continue investigation, we must convert 0 to Null.
#Replacing 0 to NaN value
project_data[['Glucose','BloodPressure','SkinThickness','BMI', 'Insulin']]= asm2_data[['Glucose','BloodPressure','SkinThickness','BMI', 'Insulin']].replace(0,np.nan)
project_data.isnull().sum()
Pregnancies 0 Glucose 5 BloodPressure 35 SkinThickness 227 Insulin 374 BMI 11 DiabetesPedigreeFunction 0 Age 0 Outcome 0 dtype: int64
def nulls_breakdown(project_data=project_data):
dataframe_cols = list(project_data.columns)
cols_total_count = len(list(project_data.columns))
cols_count = 0
for loc, col in enumerate(dataframe_cols):
null_count = project_data[col].isnull().sum()
total_count = project_data[col].isnull().count()
percent_null = round(null_count/total_count*100, 2)
if null_count > 0:
cols_count += 1
print('[iloc = {}] {} has {} null values: {}% null'.format(loc, col, null_count, percent_null))
cols_percent_null = round(cols_count/cols_total_count*100, 2)
print('Out of {} total columns, {} contain null values; {}% columns contain null values.'.format(cols_total_count, cols_count, cols_percent_null))
nulls_breakdown()
[iloc = 1] Glucose has 5 null values: 0.65% null [iloc = 2] BloodPressure has 35 null values: 4.56% null [iloc = 3] SkinThickness has 227 null values: 29.56% null [iloc = 4] Insulin has 374 null values: 48.7% null [iloc = 5] BMI has 11 null values: 1.43% null Out of 9 total columns, 5 contain null values; 55.56% columns contain null values.
Among 5 columns contain Null values:
To determine whether to replace these null values with the mean or median, in next step we must examine the distribution of each column
#Checking data distribution by histogram
project_data.hist(figsize=(17,15))
array([[<AxesSubplot:title={'center':'Pregnancies'}>,
<AxesSubplot:title={'center':'Glucose'}>,
<AxesSubplot:title={'center':'BloodPressure'}>],
[<AxesSubplot:title={'center':'SkinThickness'}>,
<AxesSubplot:title={'center':'Insulin'}>,
<AxesSubplot:title={'center':'BMI'}>],
[<AxesSubplot:title={'center':'DiabetesPedigreeFunction'}>,
<AxesSubplot:title={'center':'Age'}>,
<AxesSubplot:title={'center':'Outcome'}>]], dtype=object)
# Filling null values in Glucose, BloodPressure columns with the MEAN.
for col in ['Glucose', 'BloodPressure']:
project_data[col] = project_data[col].fillna(project_data[col].mean())
# Filling null values in 'Insulin', 'SkinThickness', 'BMI' columns with the MEDIAN.
for col in ['Insulin', 'SkinThickness', 'BMI']:
project_data[col] = project_data[col].fillna(project_data[col].median())
#asm2_data['Insulin', 'SkinThickness', 'BMI'] = asm2_data['Insulin'].fillna(asm2_data['Insulin'].median())
# Visualize null values as a matrix using the missingno package to determine whether the columns contain null values.
missingno.matrix(project_data)
plt.show()
The project data set no longer contains missing values.
#Detect outliers of diabetes data set by using sns.boxplot
fig, ax = plt.subplots(nrows = 5, ncols = 2, figsize = (14,8))
for column, subplot in zip(project_data, ax.flatten()):
sns.boxplot(x = project_data[column], ax = subplot)
fig.suptitle('Detected outliers in Pimas diabetes data set by Scattered Boxplots', fontsize=17)
fig.tight_layout()
plt.show()
In the end of the data preprocessing process, the project_data has been modified so that:
This data set is now ready for the model development process.
#Using sns.displot() to check distribution of Outcome (0 = Non-Diabetes, 1 = Diabetes) in each variable
axes = axes.ravel() # array to 1D
for column, ax in zip(project_data, axes):
data = project_data[[col, 'Outcome']] # select the data
sns.displot(data=project_data, x=project_data[column], hue='Outcome', ax=ax)
pd.plotting.scatter_matrix(project_data, figsize=(20, 20))
array([[<AxesSubplot:xlabel='Pregnancies', ylabel='Pregnancies'>,
<AxesSubplot:xlabel='Glucose', ylabel='Pregnancies'>,
<AxesSubplot:xlabel='BloodPressure', ylabel='Pregnancies'>,
<AxesSubplot:xlabel='SkinThickness', ylabel='Pregnancies'>,
<AxesSubplot:xlabel='Insulin', ylabel='Pregnancies'>,
<AxesSubplot:xlabel='BMI', ylabel='Pregnancies'>,
<AxesSubplot:xlabel='DiabetesPedigreeFunction', ylabel='Pregnancies'>,
<AxesSubplot:xlabel='Age', ylabel='Pregnancies'>,
<AxesSubplot:xlabel='Outcome', ylabel='Pregnancies'>],
[<AxesSubplot:xlabel='Pregnancies', ylabel='Glucose'>,
<AxesSubplot:xlabel='Glucose', ylabel='Glucose'>,
<AxesSubplot:xlabel='BloodPressure', ylabel='Glucose'>,
<AxesSubplot:xlabel='SkinThickness', ylabel='Glucose'>,
<AxesSubplot:xlabel='Insulin', ylabel='Glucose'>,
<AxesSubplot:xlabel='BMI', ylabel='Glucose'>,
<AxesSubplot:xlabel='DiabetesPedigreeFunction', ylabel='Glucose'>,
<AxesSubplot:xlabel='Age', ylabel='Glucose'>,
<AxesSubplot:xlabel='Outcome', ylabel='Glucose'>],
[<AxesSubplot:xlabel='Pregnancies', ylabel='BloodPressure'>,
<AxesSubplot:xlabel='Glucose', ylabel='BloodPressure'>,
<AxesSubplot:xlabel='BloodPressure', ylabel='BloodPressure'>,
<AxesSubplot:xlabel='SkinThickness', ylabel='BloodPressure'>,
<AxesSubplot:xlabel='Insulin', ylabel='BloodPressure'>,
<AxesSubplot:xlabel='BMI', ylabel='BloodPressure'>,
<AxesSubplot:xlabel='DiabetesPedigreeFunction', ylabel='BloodPressure'>,
<AxesSubplot:xlabel='Age', ylabel='BloodPressure'>,
<AxesSubplot:xlabel='Outcome', ylabel='BloodPressure'>],
[<AxesSubplot:xlabel='Pregnancies', ylabel='SkinThickness'>,
<AxesSubplot:xlabel='Glucose', ylabel='SkinThickness'>,
<AxesSubplot:xlabel='BloodPressure', ylabel='SkinThickness'>,
<AxesSubplot:xlabel='SkinThickness', ylabel='SkinThickness'>,
<AxesSubplot:xlabel='Insulin', ylabel='SkinThickness'>,
<AxesSubplot:xlabel='BMI', ylabel='SkinThickness'>,
<AxesSubplot:xlabel='DiabetesPedigreeFunction', ylabel='SkinThickness'>,
<AxesSubplot:xlabel='Age', ylabel='SkinThickness'>,
<AxesSubplot:xlabel='Outcome', ylabel='SkinThickness'>],
[<AxesSubplot:xlabel='Pregnancies', ylabel='Insulin'>,
<AxesSubplot:xlabel='Glucose', ylabel='Insulin'>,
<AxesSubplot:xlabel='BloodPressure', ylabel='Insulin'>,
<AxesSubplot:xlabel='SkinThickness', ylabel='Insulin'>,
<AxesSubplot:xlabel='Insulin', ylabel='Insulin'>,
<AxesSubplot:xlabel='BMI', ylabel='Insulin'>,
<AxesSubplot:xlabel='DiabetesPedigreeFunction', ylabel='Insulin'>,
<AxesSubplot:xlabel='Age', ylabel='Insulin'>,
<AxesSubplot:xlabel='Outcome', ylabel='Insulin'>],
[<AxesSubplot:xlabel='Pregnancies', ylabel='BMI'>,
<AxesSubplot:xlabel='Glucose', ylabel='BMI'>,
<AxesSubplot:xlabel='BloodPressure', ylabel='BMI'>,
<AxesSubplot:xlabel='SkinThickness', ylabel='BMI'>,
<AxesSubplot:xlabel='Insulin', ylabel='BMI'>,
<AxesSubplot:xlabel='BMI', ylabel='BMI'>,
<AxesSubplot:xlabel='DiabetesPedigreeFunction', ylabel='BMI'>,
<AxesSubplot:xlabel='Age', ylabel='BMI'>,
<AxesSubplot:xlabel='Outcome', ylabel='BMI'>],
[<AxesSubplot:xlabel='Pregnancies', ylabel='DiabetesPedigreeFunction'>,
<AxesSubplot:xlabel='Glucose', ylabel='DiabetesPedigreeFunction'>,
<AxesSubplot:xlabel='BloodPressure', ylabel='DiabetesPedigreeFunction'>,
<AxesSubplot:xlabel='SkinThickness', ylabel='DiabetesPedigreeFunction'>,
<AxesSubplot:xlabel='Insulin', ylabel='DiabetesPedigreeFunction'>,
<AxesSubplot:xlabel='BMI', ylabel='DiabetesPedigreeFunction'>,
<AxesSubplot:xlabel='DiabetesPedigreeFunction', ylabel='DiabetesPedigreeFunction'>,
<AxesSubplot:xlabel='Age', ylabel='DiabetesPedigreeFunction'>,
<AxesSubplot:xlabel='Outcome', ylabel='DiabetesPedigreeFunction'>],
[<AxesSubplot:xlabel='Pregnancies', ylabel='Age'>,
<AxesSubplot:xlabel='Glucose', ylabel='Age'>,
<AxesSubplot:xlabel='BloodPressure', ylabel='Age'>,
<AxesSubplot:xlabel='SkinThickness', ylabel='Age'>,
<AxesSubplot:xlabel='Insulin', ylabel='Age'>,
<AxesSubplot:xlabel='BMI', ylabel='Age'>,
<AxesSubplot:xlabel='DiabetesPedigreeFunction', ylabel='Age'>,
<AxesSubplot:xlabel='Age', ylabel='Age'>,
<AxesSubplot:xlabel='Outcome', ylabel='Age'>],
[<AxesSubplot:xlabel='Pregnancies', ylabel='Outcome'>,
<AxesSubplot:xlabel='Glucose', ylabel='Outcome'>,
<AxesSubplot:xlabel='BloodPressure', ylabel='Outcome'>,
<AxesSubplot:xlabel='SkinThickness', ylabel='Outcome'>,
<AxesSubplot:xlabel='Insulin', ylabel='Outcome'>,
<AxesSubplot:xlabel='BMI', ylabel='Outcome'>,
<AxesSubplot:xlabel='DiabetesPedigreeFunction', ylabel='Outcome'>,
<AxesSubplot:xlabel='Age', ylabel='Outcome'>,
<AxesSubplot:xlabel='Outcome', ylabel='Outcome'>]], dtype=object)
corr = project_data.corr()
fig, ax = plt.subplots(figsize = (12,8))
sns.heatmap(corr, annot = True, linewidths = 0.5, ax = ax)
plt.title('Correlation Matrix Heatmap', fontsize=18, pad=18)
plt.show()
Glucose, BMI, Pregnancies, Diabetes Pedigree Function and Age are primary medical indicators that have a positive correlation with the development of diabetes. This can be discussed in detail as follows:
On the other hand, blood pressure, insulin, and skin thickness also affect the likelihood of developing diabetes, but not significantly.
Insulin, Skin Thickness, and Diabetes Pedigree Function all correlate negatively with Pregnancies. In particular, during pregnancy, insulin levels in female's body frequently decline because of a hormone generated by the placenta. Additionally, as the size of the foetus increases, the skin of pregnant women grows thinner. The more the number of pregnancies a woman has, the less elastic her skin gets (Australia Government - Department of Health 2019).
Furthermore, as previously mentioned, some women do not have diabetes before they get pregnant; they develop gestational diabetes only during pregnancy, and the majority of these women return to being nondiabetic after giving birth (Diabetes Australia 2020). Therefore, the inverse correlation between Pregnancies and Diabetes Pedigree Function implies that not all examined women in the Pima diabetes data set contain the hereditary diabetes gene.
In order to forecast the likelihood of diabetes in women based on the Pima diabetes data set, this project required the development of a SUPERVISED machine learning model employing CLASSIFICATION algorithms that meet the following criteria:
02 learning algorithms were used in Assignment 2
Adding one more learning algorithms in this project
Random Forest is a strategy for classifying a large number of independent functioning decision trees (Data Camp 2016). Each individual tree in the random forest creates a class prediction using the bagging technique, and the class with the greatest number of votes is the model's prediction (Data Camp 2016).
Random Forest is the selected algorithm because it is insensitive to outliers in the data set, satisfying the machine learning requirements outlined in subsection 2.2. In addition, the bagging technique enables Random Forest to overcome the low-bias, high-variance issues of Assignment 2's Decision Tree algorithm, which caused Decision Tree to overfit the data (Data Camp 2016).
RandomForestClassifier(): is the function used to perform Random Forest learning algorithm within Scikit learn library.
Hyperparameter: max_depth, n_estimators
# Packages for Data Modelling Process
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.model_selection import KFold
from sklearn.metrics import f1_score
Most scientists on research gate (topic and topic) and data scientists on stackoverflow suggest that 70:30 and 80:20 are the most often used ratios in machine learning data partitioning.
Because the training data set must be sufficiently large to implement k-fold cross validation, the ratio 80:20, commonly known as the Pareto principle, has been chosen for this report. Hence, the procedure for dividing data comprises two steps:
Satisfying the machine learning tasks 2.2 - Selecting 08 features Glucose, BMI, Pregnancies, Diabetes Pedigree Function, Age, Blood Pressure, Insulin, and Skin Thickness to develop model
x = project_data.drop('Outcome', axis = 1)
y = project_data['Outcome']
#Checking scale of independent features
x.describe().T[['min', 'max']]
| min | max | |
|---|---|---|
| Pregnancies | 0.000 | 17.00 |
| Glucose | 44.000 | 199.00 |
| BloodPressure | 24.000 | 122.00 |
| SkinThickness | 7.000 | 99.00 |
| Insulin | 14.000 | 846.00 |
| BMI | 18.200 | 67.10 |
| DiabetesPedigreeFunction | 0.078 | 2.42 |
| Age | 21.000 | 81.00 |
08 independent features all have different ranges, we need normalizing them to be between 0 and 1
Satisfying the machine learning tasks 2.2 - Maintaining the data distribution in each independent parameter
#MinMaxScaler has default feature_range=(0, 1)
scaler = MinMaxScaler()
scaler.fit(x)
X = scaler.fit_transform(x)
X
array([[0.35294118, 0.67096774, 0.48979592, ..., 0.31492843, 0.23441503,
0.48333333],
[0.05882353, 0.26451613, 0.42857143, ..., 0.17177914, 0.11656704,
0.16666667],
[0.47058824, 0.89677419, 0.40816327, ..., 0.10429448, 0.25362938,
0.18333333],
...,
[0.29411765, 0.49677419, 0.48979592, ..., 0.16359918, 0.07130658,
0.15 ],
[0.05882353, 0.52903226, 0.36734694, ..., 0.24335378, 0.11571307,
0.43333333],
[0.05882353, 0.31612903, 0.46938776, ..., 0.24948875, 0.10119556,
0.03333333]])
Satisfying the machine learning tasks 2.2 - Preserve the imbalance of classes in the Outcome column after partitioning the dataset into train and test sets
As stated in subsection 2.1, there is an imbalance between the number of diabetes and non-diabetes instances in the label column of the Pima diabetes data set. Therefore, one of the objectives for machine learning task 2.2 is to divide the dataset into train and test sets while preserving the same proportions of samples in each class in label column as seen in the original dataset.
This is possible by utilising the train_test_split() function and setting the stratify argument to the y component of the initial dataset (Scikit learn Documentation).
# split imbalanced dataset into train and test sets with stratification, ratio 80/20
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
(614, 8) (154, 8) (614,) (154,)
Propose 3 possible values for n_estimators and max_depth
n_estimators = 200
max_depth = 18
# split the training dataset into train set and validation set with stratification, ratio 80/20
X_train1, X_validate1, y_train1, y_validate1 = train_test_split(X_train, y_train, test_size = 0.2, random_state = 42, stratify = y_train)
print(X_train1.shape, X_validate1.shape, y_train1.shape, y_validate1.shape)
(491, 8) (123, 8) (491,) (123,)
As described in the book An Introduction to Statistical Learning, 5 and 10 folds are the most often employed fold numbers. This number of folds has been demonstrated to produce the ideal balance between bias and variance.
kfold = 5 will be used in this project.
#define kfold = 5
kfold = KFold(n_splits=5)
#list of expanding hyperparameters
n_neighbors_list = [20,30,40,50,60]
for i in n_neighbors_list:
#develop machine model for each hyperparameters proposed in list
#train model on train set
knn = KNeighborsClassifier(n_neighbors = i).fit(X_train1,y_train1)
#evaluate performance on validation set
scores1 = cross_val_score(knn, X_validate1,y_validate1, cv=kfold)
#calculate mean of each fold and each hyperparameter
as_1 = scores1.mean()
print("- Average accuracy score over 5 folds: " + str(as_1))
- Average accuracy score over 5 folds: 0.6993333333333334
#define kfold = 5
kfold = KFold(n_splits=5)
#list of expanding hyperparameters
max_depth_list = [1,2,3,4,5]
for i in max_depth_list:
#develop machine model for each hyperparameters proposed in list
#train model on train set
dt = DecisionTreeClassifier(max_depth = i).fit(X_train1,y_train1)
#evaluate performance on validation set
scores2 = cross_val_score(dt, X_validate1,y_validate1, cv=kfold)
#calculate mean of each fold and each hyperparameter
as_2 = scores2.mean()
print("- Average accuracy score over 5 folds: " + str(as_2))
- Average accuracy score over 5 folds: 0.7236666666666667
#define kfold = 5
kfold = KFold(n_splits=5)
#list of expanding hyperparameters
n_estimators_list = [160,180,200]
max_depth_list = [18,20,22]
for i, j in zip(n_estimators_list, max_depth_list):
#develop machine model for each hyperparameters proposed in list
#train model on train set
rd = RandomForestClassifier(n_estimators = i, max_depth=j ).fit(X_train1,y_train1)
#evaluate performance on validation set
scores3 = cross_val_score(rd, X_validate1,y_validate1, cv=kfold)
#calculate mean of each fold and each hyperparameter
as_3 = scores3.mean()
print("- Average accuracy score over 5 folds: " + str(as_3))
- Average accuracy score over 5 folds: 0.8293333333333333
#combine accuracy score of 03 model to a data frame for visualization
model_cp = {}
model_cp['K-nearest Neighbors'] = [as_1]
model_cp['Decision Tree'] = [as_2]
model_cp['Random Forest'] = [as_3]
model_cp_df = pd.DataFrame.from_dict(model_cp).T
model_cp_df.columns = ['Average accuracy score']
model_cp_df
| Average accuracy score | |
|---|---|
| K-nearest Neighbors | 0.699333 |
| Decision Tree | 0.723667 |
| Random Forest | 0.829333 |
#visual accuracy score of 03 models under bar chart
model_cp_df.plot(kind="barh", figsize=(12, 10))
plt.title("Classification Models Comparison", fontsize=18, pad=20)
plt.xlabel("Metrics", fontsize=13)
plt.ylabel("Type of Classification Model", fontsize=13)
plt.show()
Random Forest is the most effective learning algorithm with an average accuracy score over folds of 0.83, followed by Decision Tree with an average accuracy score over folds of approximately 0.72, and K-nearest Neighbors with a score of 0.69.
This can be explained by the fact that regression models such as K-nearest Neighbors are frequently highly sensitive to outliers, hence data sets containing outliers, such as the Pima diabetes data set, will frequently not provide reliable predictions.
Random Forest is the final model chosen to be applied to a test data set in the next section.
#Step 1: Find the best value for hyperparameter of RandomForest model
tuned_parameters = {'n_estimators': [160,180,200], 'max_depth':[18,20,22]}
#define kfold = 5
kfold = KFold(n_splits=5)
#specify model
model = RandomForestClassifier()
#using GridSearchCV() to find best hyperparameter
grid = GridSearchCV(estimator=model, param_grid=tuned_parameters , cv=kfold)
grid.fit(X_train, y_train)
print("- Best cv_scores: " + str(grid.best_score_))
print("- Best parameter: " + str(grid.best_params_))
- Best cv_scores: 0.7671331467413035
- Best parameter: {'max_depth': 22, 'n_estimators': 160}
#building model with best value of hyperparameter
#train model with train data
rd = RandomForestClassifier(max_depth=22, n_estimators=160).fit(X_train,y_train)
#evaluate performance on test data
y_test_pred = rd.predict(X_test)
print (f"1/ Confusion Matrix: \n {confusion_matrix(y_test, y_test_pred)}")
print(f"2/ Accuracy Score: {accuracy_score(y_test, y_test_pred)}")
print(f"3/ F1 Score: {f1_score(y_test, y_test_pred, average='weighted')}")
print(f"4/ Classification Report: \n {classification_report(y_test, y_test_pred)}")
1/ Confusion Matrix:
[[83 17]
[23 31]]
2/ Accuracy Score: 0.7402597402597403
3/ F1 Score: 0.7364029459974635
4/ Classification Report:
precision recall f1-score support
0 0.78 0.83 0.81 100
1 0.65 0.57 0.61 54
accuracy 0.74 154
macro avg 0.71 0.70 0.71 154
weighted avg 0.73 0.74 0.74 154
#visualize confusion matrix by heatmap
sns.set(font_scale=1.5)
import seaborn as sns
sns.heatmap(confusion_matrix(y_test, y_test_pred), annot=True,cbar=False, fmt='g')
plt.xlabel("True label")
plt.ylabel("Predicted label");
#origial shape of label test set
y_test.shape
(154,)
#the label test set has been added 01 column named 'Prediction'
y_test['Prediction'] = y_test_pred
y_test.shape
(155,)
import pickle
pickle.dump(rd, open('model.pkl', 'wb'))
Despite the fact that Random Forest performs well with outliers, like other tree-based models, it has a potential to overfit training data. To enhance the model, I come up with some suggestions:
[2] https://www.diabetesaustralia.com.au/
[3] https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
[4] https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
[5] https://www.datacamp.com/courses/machine-learning-with-tree-based-models-in-python
[6] https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
[7] https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html